## Parsed with column specification:
## cols(
## Number = col_character(),
## Name = col_character(),
## Latitude = col_double(),
## Longitude = col_double(),
## `Total docks` = col_double()
## )
## Parsed with column specification:
## cols(
## `Start date` = col_character(),
## `Start station` = col_character(),
## `Start station number` = col_character(),
## `End date` = col_character(),
## `End station` = col_character(),
## `End station number` = col_character(),
## `Account type` = col_character(),
## `Total duration (Seconds)` = col_double()
## )
Are more rides taken where the concentration of nice ride stations is higher?
## Parsed with column specification:
## cols(
## Number = col_character(),
## Name = col_character(),
## Latitude = col_double(),
## Longitude = col_double(),
## `Total docks` = col_double()
## )
## Parsed with column specification:
## cols(
## `Start date` = col_character(),
## `Start station` = col_character(),
## `Start station number` = col_character(),
## `End date` = col_character(),
## `End station` = col_character(),
## `End station number` = col_character(),
## `Account type` = col_character(),
## `Total duration (Seconds)` = col_double()
## )
## Parsed with column specification:
## cols(
## Number = col_character(),
## Name = col_character(),
## Latitude = col_double(),
## Longitude = col_double(),
## `Total docks` = col_double()
## )
## Parsed with column specification:
## cols(
## `Start date` = col_character(),
## `Start station` = col_character(),
## `Start station number` = col_character(),
## `End date` = col_character(),
## `End station` = col_character(),
## `End station number` = col_character(),
## `Account type` = col_character(),
## `Total duration (Seconds)` = col_double()
## )
## # A tibble: 6 x 2
## trip_name n
## <chr> <int>
## 1 Lake Street & Knox Ave S to Lake Street & Knox Ave S 4853
## 2 Lake Calhoun Center to Lake Calhoun Center 3171
## 3 Lake Harriet Bandshell to Lake Harriet Bandshell 2650
## 4 Lake Como Pavilion to Lake Como Pavilion 2156
## 5 W 36th Street & W Calhoun Parkway to W 36th Street & W Calhoun Par… 2037
## 6 Willey Hall to Weisman Art Museum 1857
Above is a table of the top 6 most frequent rides in the data set. The five most frequent rides have the same start and end station. For example, the most frequent ride with a total of 4853 rides is from Lake Street & Knox Ave S and back to Lake Street & Knox Ave S. This is followed a total of 3171 rides from Lake Calhoun Center and back to Lake Calhoun Center. This data is very interesting because it can be used to understand how consumers use this ride serivce. From the data, it is aparent that most individuals start and return to the same station. This is vital data because while particular stations may be more popular than others, it must be taken into account that vehicles will be return to these stations. So when determining how many docks are needed at popular locations, the provider must not only consider the most popular stations but also how frequent customers return scooters to these locations. If a particular location is a common one in which vehicles return, then less docks will be needed. The rate by which scooters return to a particular location can be modeled by using a poisson process in order to maximize efficiency. The sixth most popular ride is one that deviates from this pattern as the start and end location are different. There were 1857 rides from Willey Hall to Weisman Art Museum. In this case, this trip may represent a destination trip rather than a return trip. By classyfing particular rides as desintaiton (one in which the scooter starts somewhere and is dropped off somewhere else) or return trips (one in which a customer starts and ends at the same location), the provider can again better estimate how many docks are needed at eaach location.
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
For my plot, I used the left_join fuction twice to have the start and end coordinates in one plot. I did this by joining once by the start station number and once by the end station numbers. I then created a new column with the displacement for each of the x and the y coordinates, and multiplied them by 69, to convert degrees to miles. I used the pythagorean formula to calculate the total displacement. I did not take the curvature of the earth into account, because it was negligible at this scale. I then plotted the displacement on the y axis and trip duration on the x axis to determine for each duration, how long the average trip was. I found that for the first 30 minutes, the average displacement increased with time, but then went back down and leveled off. This is likely due to people who take 30 minute trips or less tend to make one way trips, while the people who have longer trips will do a round trip and bring the bike back closer to where they started the trip, meaning the total distance is farther but the displacement is shorter.
I looked at rides on popular holidays (Thanksgiving and Christmas) to see if there were more rides or less rides on those days. This was in order to determine whether more bikes were needed during the holidays or not. According to the Minneapolis bike maps for Thanksgiving and Christmas, there are less riders during these two holidays. This makes sense because most people stay in with their families during these times of the year. Additionally, these two holidays are during the winter season so less people may want to bike places due to the cold weather (especially in Minneapolis).
## Parsed with column specification:
## cols(
## Number = col_character(),
## Name = col_character(),
## Latitude = col_double(),
## Longitude = col_double(),
## `Total docks` = col_double()
## )
## Parsed with column specification:
## cols(
## `Start date` = col_character(),
## `Start station` = col_character(),
## `Start station number` = col_character(),
## `End date` = col_character(),
## `End station` = col_character(),
## `End station number` = col_character(),
## `Account type` = col_character(),
## `Total duration (Seconds)` = col_double()
## )
## Joining, by = "day_of_week"
The following tables show the breakdown of member vs casual riders in the top 1000 longest and 1000 shortest rides in the data set. Note that the short ride count totals to more than 1000- this is because many of the short ride durations are equal, and the 1000 shortest rides refers to the 1000 shortest ride durations.
## # A tibble: 2 x 2
## Account `Long ride count`
## <chr> <int>
## 1 Casual 933
## 2 Member 67
## # A tibble: 2 x 2
## Account `Short ride count`
## <chr> <int>
## 1 Casual 135
## 2 Member 930
One finding from these tables is that the longest rides in the set were primarily from casual riders, whereas the opposite is true for the shortest rides in the set. Thus, it may be true that shorter rides are more likely to be member riders, and longer rides are more likely to be casual riders.
One discrepency exists in short rides which start and end at the same station. Upon inspection of the data, it appears that casual riders are twice as likely to start and end at the same station. Ruling these cases out in the shortest ride data presents a bias against casual riders in the data representation.
## # A tibble: 3 x 2
## Account n
## <chr> <int>
## 1 Casual 39665
## 2 Inconnu 2
## 3 Member 15639
Next, I investigated the days of the week the longest and shortest rides were happening on. The counts are summarised in the following chart.
The bars show the distribution of the longest rides in the data set. Clearly, the longest rides are more likely to happen on weekend days, with a prominent peak on Saturday. This is congruent with our intuition that individuals would be more likely to take longer rides on the weekend. Similarly, the shortest rides in the data set (represented by the black lines) are much more consistent across the work week, and decline slightly on the weekends.
We can conclude from this chart that the longest rides in the dataset occur on the weekends, whereas the shortest rides in the dataset are more evenly distributed throughout the week, with a small decline on the weekends.
Finally, we can plot the 10 longest and 10 shortest rides over a map of Minneapolis.
Again, it’s easy to see how the short rides often start and end in the same location. Furthermore, most of the shortest rides occur near bodies of water in the area, and are generally clustered toward the center of the city. On the other hand, the longest rides in the set unsurprisingly cover larger distances, and are less centralized. While some of the distances are relatively short, the longer rides on this plot are generally farther apart.
This plot is illuminating to show the popular locations of the shortest and longest rides in the set.
For this lab, I sought to anser the question of which trips are the most popular in order to learn more about how many docks should be at each location. In order to perform this data analysis, I had to make use of several tidyverse functions. I initially used the read_csv function to import the respective data sets. I had to first rename the Start station number from one of the tibbles to Number so that this column title would match so that I could perform a left join. I joined the data sets based on the Number of the ride. I used left join because I only wanted longitudes and latitudes for rides the stations that had rides occur. I then performed another left join in order to get latidude and longitue for the End station. I then selected only those columsn needed for my analysis. I mutated my new tibble to get a new column for trip_name by concatenating Start station and End station. I then dropped the na values. And, I seperated the Start and End date columns so that I got two columns for each corresponding to Date and Time. Finally, I used the pip to count the trip_name column and arrange in descending order of n to show the most frequent trips.
For this lab, I wanted to demonstate that I could use the join fuction properly, and I did that by using join to attach the latitude and longitude values to the start and end destination for every trip. when joining, I only selected the columns that would be relevant to the question. Using this data, I then did a scatter plot showing the correlation between the displacement and the duration of the trip.
For this lab, I answered the question of whether more or less bikes are needing during the holidays. I created maps with arrows repesenting peoples’ bike paths during Thanksgiving and Christmas which are two of the most popular holidays. I customized the maps with specific arrows, colors, labels, and axes. I used filter to create the specific maps for Thanksgiving and Christmas and used autoplot to create the maps. I also created the team plot to answer the overall team question. I found that more stations are concentrated within the hearts of the cities such as Minneapolis and St. Paul which makes sense since their populations are higher. Accoring to findings made my group, there are a couple streets within these cities that get the most rides. Therefore, a possible solution may be to add more bikes to these stations so more people can take advantage of them.
For this lab, I created and formatted the team document. I also worked to override defaults on my computer in order to load and run the OpenStreetMap library to compile the document for my group. In addition, I completed my individual analysis and reported several findings about the data. In the tidyverse, I read in the data and completed some mutating joins to combine data from the two tables. I used the select, filter, rename functions to simplify my data frames. I used the mutate, top_n(), and count functions to create unique variables of which I could evaluate the characteristics of the longest and shortest rides in the dataset. Finally, I used functions such as scale_x_discrete() and labs() to customize my charts and plots.